30 research outputs found

    Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor

    Get PDF
    We present progress on Joshua, an opensource decoder for hierarchical and syntaxbased machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation (Zollmann and Venugopal, 2006) grammars. It is built on Apache Hadoop for efficient distributed performance, and can easily be extended with support for new grammars, feature functions, and output formats.

    Genie: A Generator of Natural Language Semantic Parsers for Virtual Assistant Commands

    Full text link
    To understand diverse natural language commands, virtual assistants today are trained with numerous labor-intensive, manually annotated sentences. This paper presents a methodology and the Genie toolkit that can handle new compound commands with significantly less manual effort. We advocate formalizing the capability of virtual assistants with a Virtual Assistant Programming Language (VAPL) and using a neural semantic parser to translate natural language into VAPL code. Genie needs only a small realistic set of input sentences for validating the neural model. Developers write templates to synthesize data; Genie uses crowdsourced paraphrases and data augmentation, along with the synthesized data, to train a semantic parser. We also propose design principles that make VAPL languages amenable to natural language translation. We apply these principles to revise ThingTalk, the language used by the Almond virtual assistant. We use Genie to build the first semantic parser that can support compound virtual assistants commands with unquoted free-form parameters. Genie achieves a 62% accuracy on realistic user inputs. We demonstrate Genie's generality by showing a 19% and 31% improvement over the previous state of the art on a music skill, aggregate functions, and access control.Comment: To appear in PLDI 201

    cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models

    Get PDF
    We present cdec, an open source framework for decoding, aligning with, and training a number of statistical machine translation models, including word-based models, phrase-based models, and models based on synchronous context-free grammars. Using a single unified internal representation for translation forests, the decoder strictly separates model-specific translation logic from general rescoring, pruning, and inference algorithms. From this unified representation, the decoder can extract not only the 1- or k-best translations, but also alignments to a reference, or the quantities necessary to drive discriminative training using gradient-based or gradient-free optimization techniques. Its efficient C++ implementation means that memory use and runtime performance are significantly better than comparable decoders.

    Learning to Translate with Products of Novices: A Suite of Open-Ended Challenge Problems for Teaching MT

    Get PDF
    Machine translation (MT) draws from several different disciplines, making it a complex subject to teach. There are excellent pedagogical texts, but problems in MT and current algorithms for solving them are best learned by doing. As a centerpiece of our MT course, we devised a series of open-ended challenges for students in which the goal was to improve performance on carefully constrained instances of four key MT tasks: alignment, decoding, evaluation, and reranking. Students brought a diverse set of techniques to the problems, including some novel solutions which performed remarkably well. A surprising and exciting outcome was that student solutions or their combinations fared competitively on some tasks, demonstrating that even newcomers to the field can help improve the state-ofthe-art on hard NLP problems while simultaneously learning a great deal. The problems, baseline code, and results are freely available.

    Data-driven sentence simplification: Survey and benchmark

    Get PDF
    Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common datasets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments

    PARADIGM: Paraphrase Diagnostics through Grammar Matching

    No full text
    Paraphrase evaluation is typically done ei-ther manually or through indirect, task-based evaluation. We introduce an in-trinsic evaluation PARADIGM which mea-sures the goodness of paraphrase col-lections that are represented using syn-chronous grammars. We formulate two measures that evaluate these paraphrase grammars using gold standard sentential paraphrases drawn from a monolingual parallel corpus. The first measure calcu-lates how often a paraphrase grammar is able to synchronously parse the sentence pairs in the corpus. The second mea-sure enumerates paraphrase rules from the monolingual parallel corpus and calculates the overlap between this reference para-phrase collection and the paraphrase re-source being evaluated. We demonstrate the use of these evaluation metrics on para-phrase collections derived from three dif-ferent data types: multiple translations of classic French novels, comparable sen-tence pairs drawn from different newspa-pers, and bilingual parallel corpora. We show that PARADIGM correlates with hu-man judgments more strongly than BLEU on a task-based evaluation of paraphrase quality.
    corecore